Topic Set Size Design with Variance Estimates from Two-Way ANOVA
نویسنده
چکیده
Recently, Sakai proposed two methods for determining the topic set size n for a new test collection based on variance estimates from past data: the first method determines the minimum n to ensure high statistical power [22], while the second method determines the minimum n to ensure tight confidence invervals [23]. These methods are based on statistical techniques described by Nagata [15]. While Sakai [22] used variance estimates based on oneway ANOVA, Sakai [23] used the 95% percentile method proposed by Webber, Moffat and Zobel [38]. This paper reruns the experiments reported by Sakai [22, 23] using variance estimates based on two-way ANOVA [17], which turn out to be slightly larger than their one-way ANOVA counterparts and substantially larger than the percentile-based ones. If researchers should choose to “err on the side of over-sampling” as recommened by Ellis [10], the variance estimation method based on two-way ANOVA and the results reported in this paper are probably the ones researchers should adopt. We also establish empirical relationships between the two topic set size design methods, and discuss the balance between n and the pool depth pd using both methods.
منابع مشابه
Classroom Simulation: Understanding One-way Random-effect Anova
The one-way random-effect ANOVA model is presented, and two simulated datasets are analyzed. and discussed from three points of view: (1) The standard ANOVA table, F test, and method-of-moments estimates of variance components, which can lead to negative estimates. (2) Maximum likelihood estimates of variance components. (3) Bayesian probability intervals for variance components based on flat p...
متن کاملOn Estimating Variances for Topic Set Size Design
Topic set size design is a suite of statistical techniques for determining the appropriate number of topics when constructing a new test collection. One vital input required for these techniques is an estimate of the population variance of a given evaluation measure, which in turn requires a topic-by-run score matrix. Hence, to build a new test collection, a pilot data set is a prerequisite. Re...
متن کاملComparison of sediment grain size analysis among two methods and three instruments using environmental samples
Sediment grain size is measured using a variety of methods, but comparisons of measurement methods on environmental samples are limited. Three instruments (Coulter LS230, Horiba LA900, and SediGraph 5100) utilizing two fundamentally different operating principles were employed to measure a single set of 20 different sediment samples collected at shelf depths from the Southern California Bight. ...
متن کاملTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Short Text Conversation (STC) is a new NTCIR task which tackles the following research question: given a microblog repository and a new post to that microblog, can systems reuse an old comment from the respository to satisfy the author of the new post? The official evaluation measures of STC are normalised gain at 1 (nG@1), normalised expected reciprocal rank at 10 (nERR@10), and P, all of whic...
متن کاملEvaluating Evaluation Measures with Worst-Case Confidence Interval Widths
IR evaluation measures are oen compared in terms of rank correlation between two system rankings, agreement with the users’ preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Condence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-ca...
متن کامل